Methods and Systems of PNDF Dual and BTF based Document Frequency Weighting Schemes

ABSTRACT

The present invention first proposes a novel expression for pairwise positive document weighting scheme and defines its symmetric dual, negative document frequency weight scheme. Their relation equations are derived and their global normalized forms are also provided. Their combination positive negative document scheme is also defined, which can quantify the features&#39; capability of measuring the commonness of documents as well as the features&#39; capability of distinguishing documents. The invention further proposes another form for positive document frequency via applying the strict proper score algorithm and its dual form for negative document frequency is also derived. The invention also defines the binary term frequency and its associated various document representation methods when combined with different weight schemes. The extension from common discrete token features to slightly complex features such as sentences are also presented for the above schemes and term frequencies. The invention also illustrates the application details for the classical Euclidean document distance computation and the optimal transportation based document distance computation.

FIELD OF THE INVENTION

This non-provisional invention application refers to the earlier provisional patent application #63/088,430 as a continuation.

This invention considers the document frequency weighting scheme methods in classical information retrieval systems and their applications in machine learning modeling.

It relates to the prior art of Inverse Document Frequency (IDF) and associated Term Frequency Inverse Document Frequency (TF-IDF) which has been widely used for several decades.

The invention proposes some novel document weighting methods in information retrieval, which are universally applicable to all modern machine learning method frameworks for common tasks such as classification, prediction, webpage ranking and recommendation tasks etc.

Particularly we illustrate the application to two scenarios, namely the classical Euclidean distance computation and the popular Optimal Transportation (OT) based document distance computation in natural language processing.

BACKGROUND OF THE INVENTION

This invention cites the earlier submitted Provisional Patent Application #63/198,209 and can be regarded as some further innovative development along the series.

In information retrieval systems and machine learning, it is a general belief that the given data collection contains different amount of information for features. Some features are more informative while some features are less informative. In other words, some features have relatively more import while some features are relatively less important for the considered tasks. In the past several decades researchers have developed various feature weighting methods which assign each feature a weight quantifying its importance.

One important observation made by Karen Sparck Jones in 1972 is that if a word w appears in more documents in a corpus collection, then the word is common and it becomes less effective at distinguishing the documents. Inversely, if a word appears in very few documents, then the word is rare and it becomes very effective at distinguishing documents than frequent words. Let's use n denote the total number of documents in the corpus collection and d denote the number of documents containing the word. Then

$\frac{n}{d}$

is the reciprocal of the standard document frequency. To avoid the extreme situation of vanishing d=0, a simple smoothing is given by

$\frac{c + n}{c + d},$

where c is a non-negative real number. By taking the default c=1, this leads to the classical state-of-art Inverse Document Frequency formula:

${{IDF}(w)} = {{\log\left( \frac{1 + n}{1 + d} \right)}.}$

For the i-th document D_(i) and the k-th word token w_(k), one can compute the term frequency(TF), denoted as f_(ik). It is the counts of the word w_(k) in the document divided by the total number of tokens of the document. Then the famous Term Frequency-Inverse Document Frequency (TF-IDF) is just defined as the multiplication of f_(ik) with IDF, denoted as D_(i,k)=f_(ik)DF (w_(k))

The above IDF approach comes from the point view of measuring a feature's capability of distinguishing documents. On the contrary, a simple but inspiring fact is that more sharing words appearing in two documents indicates that the two documents are more similar. From the point view of quantifying the features' capability of measuring similarity between documents, Arthur Zhang recently proposed pairwise Positive Document Frequency and the integrated PIDF which incorporates both the features' capability of distinguishing documents as well as the capability of counting commonness of documents.

However, the classical IDF is defined across the corpus as a global quantity while the PDF is defined as a local quantity for a pair of documents. The integrated PIDF works well for pairwise document distance computation. The current invention first define the normalized version of the pairwise PDF and its dual, the Negative Document Frequency and its global form. And the invention further defines the symmetric integrated Positive and Negative Document Frequency (PNDF). The invention also gives a specific algorithm to choose the parameters in the schemes, namely the Strict Proper Score method.

This and the next several following paragraphs introduce the basics of the optimal transportation based document distance computation and the downstream document classification using such computed document distance. The optimal transportation (OT) is an applied mathematical branch which studies the optimal transportation cost of moving mass from one space to another. In the past several years it has attracted a lot of interest in the machine learning community. In 2015 Kusner and his collaborators introduced the OT technique to measure the distance between documents in natural language processing.

The framework assumes that for two given documents of texts, X and Y, each is regarded as a sequence of word tokens. Ignoring the word orders, we can represent each document as a bag of words V=[w₁, w₂, . . . , w_(n)]. Here n is the size of all the vocabulary in the documents. Each document can be first represented as a vector of frequency counts, and then normalized by their total sum of the frequency counts. This finally gives X=[x₁, x₂, . . . , x_(n)] and Y=[y₁, y₂, . . . , y_(n)], where the two vectors has unit mass. That is, the documents can be regarded as two discrete probability distributions, where the machinery from optimal transportation comes into play.

Now one can transport the total mass from X to Y, either the whole or some portion of a point x_(i). This framework was first formulated by Kontsevich in 1942, namely balanced transportation. The total transportation cost is naturally defined as the distance weighted summation of moving all the mass from one space to another. Now one can ask what is the optimal transportation plan.

$\begin{matrix} {{{OT}\left( {X,Y} \right)} = {\min_{p}{\sum\limits_{ij}{P_{ij}Dis{t\left( {x_{i},y_{j}} \right)}}}}} & (1) \end{matrix}$

where P is all the possible transportation path which satisfy the following constraints: Σ_(j=1) ^(n) P_(ij)=x_(i) and Σ_(i=1) ^(n) P_(ij)=y_(j). The Dist(x_(i), y_(j)) is the distance between the word vectors of x_(i) and y_(j), which are usually pretrained using popular algorithms such as Word2vec and publicly free.

Similarly, at the sentence level, i.e, we can regard each sentence as an individual feature rather than the common words. We can count the sentence frequencies in each document and form the normalized sentence vector representation for each document. That is, X=[sx₁, sx₂, . . . , sx_(m)] and Y=[sy₁, sy₂, . . . , sy_(m)]. Here m is the total number of different sentences in the two documents. And we can have the similar OT formulation below:

$\begin{matrix} {{{OT}\left( {X,Y} \right)} = {\min_{P}{\sum\limits_{i,j}{P_{ij}Dis{t\left( {{sx_{i}},{sy}_{j}} \right)}}}}} & (2) \end{matrix}$

where P is all the possible transportation path which satisfy the following constraints: Σ_(j=1) ^(n) P_(ij)=sx_(i) and Σ_(i=1) ^(n) P_(ij)=sy_(j). The sentence vectors sx_(i) and sy_(j) are the weighted word vectors of all the words in the sentence, where the weight type for each word is identical to the selected feature document frequency type. The Dist(sx_(i), sy_(j)) is the vector distance between sentence vectors sx_(i) and sy_(j).

This paragraph reviews the classical Euclidean distance computation for a pair of documents. Following the notations above, X=[x₁, x₂, . . . , x_(n)] and Y=[y₁, y₂, . . . , y_(n)], where x_(i) and y_(j) are the word token frequencies for w_(i) and w_(j). The classical Euclidean distance between documents X and Y, denoted as Dist_(XY), is then given as

$\begin{matrix} {{Dist_{XY}} = \sqrt{\sum\limits_{k = 1}^{n}\left( {x_{k} - y_{k}} \right)^{2}}} & (3) \end{matrix}$

SUMMARY AND OBJECTS OF THE INVENTION

The invention first proposes a general form for the pairwise Positive Document Frequency (PDF) and its symmetric dual, pairwise Negative Document Frequency (NDF). Similar to the IDF, this NDF assigns a metric to each pair of documents which accounts for the feature's capability of distinguishing a pair of documents. The invention further gives the normalized PDF and NDF for a document across the collection corpus of documents by first summing all the possible pairs and then take the average.

Next the invention propose an integrated weighting scheme, namely Positive and Negative Document Frequency (PNDF), by combining the PDF and NDF together. Both the local pairwise form and the global form across the corpus are given. The local pairwise form works naturally for each pair of document distance while the global form applies as the weighting for each document. The proposed PNDF has dual capability of assessing similarity with PDF and distinguishing with NDF. The normalized version of PNDF is a global weight scheme and the associated TF-PNDF gives a simple linear complexity representation of documents.

Among the numerous formula expression choices for PDF and NDF, the invention also proposes a Strict Proper Score Algorithm method for selecting the suitable formula forms for them and derives the final forms for PNDF.

The invention also proposes a novel Binary Term Frequency (BTF) which only incorporates the presence status of a feature for a document. Its extensive and natural combination with IDF, PDF, NDF and PNDF are also given.

The natural extension of the above weighting schemes to sentence-like complex features are also given as the summation of all the word token or symbols in the sentence level feature or alike.

The proposed schemes PDF, NDF, PNDF, BDF and IDF as well as their various combinations can easily be applied as weighting methods to either pairwise documents or a single document based metrics for downstream information retrieval and machine learning tasks. Specifically for the optimal transportation based document distance computation and Euclidean based document distance, we illustrate the procedures of applying such various weightings to the word tokens or sentence-like features.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 gives a brief summary of PNDF weighting procedure for pairwise weighting and normalized version for a document representation.

DETAILED DESCRIPTION OF THE INVENTION

Let's first fix the notations here. For a corpus of documents, we use n denote the total number of documents. For a pair of documents, we use the indexes i and j and denote the documents as D_(i) and D_(j). For token or symbol features, we use w to denote a generic work. We use m denote the total number of features. The k-th feature is denoted as w_(k), where k ∈ {1, . . . , m}. The total number of a generic feature w_(k) in the corpus is the classical document frequency, denoted as d_(k) ∈ {0 . . . , n}. The total counts of a token w_(k) in a document D_(i) is denoted as c_(ik), and the corresponding term frequency is denoted as f_(ik). It is the ratio of c_(ik) with the total token counts in the document. That is

$\begin{matrix} {f_{ik} = \frac{c_{ik}}{\sum_{t = 1}^{m}c_{it}}} & (4) \end{matrix}$

First recall that for a given token feature w_(k), in the prior art Provisional Patent Application #63/198,209 Arthur Zhang invented a quantity Positive Document Frequency (PDF) for each pair of documents to summarize the feature w_(k)'s contribution to the similarity of the two documents. For a pair of documents, if we use d to denote the total counts of the w_(k) feature, then d ∈ {0,1,2}. The PDF thus has a general form below as

$\begin{matrix} {{{PDF}(w)} = \left\{ \begin{matrix} \gamma_{2} & {{{if}\mspace{14mu} d} = 2} \\ \gamma_{1} & {{{if}\mspace{14mu} d} = 1} \\ \gamma_{0} & {{{if}\mspace{14mu} d} = 0} \end{matrix} \right.} & (5) \end{matrix}$

where and γ₀, γ₁, and γ₂ are real numbers. There are numerous ways to define these numbers in terms of d, n and the document frequency d_(k).

By taking the ratio of the γ₂ and γ₀ with respect to γ₁, we can reduce the number of parameters and further simplify the formula as following:

$\begin{matrix} {{{PDF}(w)} = \left\{ \begin{matrix} {1 + \gamma_{1}} & {{{if}\mspace{14mu} d} = 2} \\ 1 & {{{if}\mspace{14mu} d} = 1} \\ {1 + \gamma_{2}} & {{{if}\mspace{14mu} d} = 0} \end{matrix} \right.} & (6) \end{matrix}$

where γ₁ and γ₂ are two real numbers. γ₁ quantifies the extra effect when the feature appears in both documents while γ₂ quantifies the effect of no showing of such feature in the documents. For example,

$\gamma_{1} = {{\log\frac{c + 2}{c + 1}\mspace{14mu}{and}\mspace{14mu}\gamma_{2}} = {\log\frac{c}{c + 1}}}$

with c=1.

For the i-th document D_(i) by iterating through all the documents in the corpus we can sum up these pairwise PDF and then take the average. This gives the normalized PDF, denoted as nPDF_(i)(w_(k)). The term has two different expressions depending upon the presence of the feature or not in the document. When the token feature w_(k) appears in document D_(i), the normalized PDF has the following expression.

$\begin{matrix} \begin{matrix} {{{nPDF}_{i}\left( w_{k} \right)} = {\frac{1}{n}\left\lbrack {{d_{k}\left( {1 + \gamma_{1}} \right)} + n - d_{k}} \right\rbrack}} \\ {= {1 + {\frac{d_{k}}{n}\gamma_{1}}}} \end{matrix} & (7) \end{matrix}$

When the token feature w_(k) does not appear in document D_(i), the normalized PDF has the following expression.

$\begin{matrix} \begin{matrix} {{{nPDF}_{i}\left( w_{k} \right)} = {\frac{1}{n}\left\lbrack {{d_{k}\left( {1 + \gamma_{1}} \right)} + n - d_{k}} \right\rbrack}} \\ {= {1 + {\left( {1 - \frac{d_{k}}{n}} \right)\gamma_{2}}}} \end{matrix} & (8) \end{matrix}$

The parameters and γ₁ and γ₂ have many choices. For example, the choice γ₁=log 3/2 and γ₂=log 1/2 empirically works pretty well in our experiments.

The invention defines the symmetric dual of the pairwise PDF, namely Negative Document Frequency (NDF), to be the following form:

$\begin{matrix} {{{NDF}_{ij}\left( w_{k} \right)} = \left\{ \begin{matrix} {2 + \gamma_{1} + y} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{in}\mspace{14mu} 2\mspace{14mu}{docs}} \\ y & {{if}\mspace{14mu} w_{k}\mspace{11mu}{in}\mspace{14mu} 1\mspace{14mu}{docs}} \\ {2 + \gamma_{2} - y} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{in}\mspace{14mu} 0\mspace{14mu}{docs}} \end{matrix} \right.} & (9) \end{matrix}$

where the parameter y is a non-negative real number, and the two documents are the i-th and j-th in the collection. Similar to IDF, it is a pairwise local metric to quantify the feature w_(k)'s capability of distinguishing the documents.

Finally, let x, y and z denote the first, second and last values respectively for each token count cases. Then PDF and its dual NDF have the following two relation equations below which describes their dynamics.

2+γ₁ =x+y   (10)

2+γ₂ =y+z

Similar to PDF, by iterating one document through the corpus NDF also has two normalized expressions across the corpus. When the token feature w_(k) appears in document D_(i), the normalized NDF has the following expression.

$\begin{matrix} \begin{matrix} {{{nND}{F_{i}\left( w_{k} \right)}} = {\frac{1}{n}\left\lbrack {{d_{k}\left( {2 + \gamma_{1} - y} \right)} + {\left( {n - d_{k}} \right)y}} \right\rbrack}} \\ {= {{\frac{d_{k}}{n}\left( {2 + \gamma_{1}} \right)} + {\left( {1 - \frac{2d_{k}}{n}} \right)y}}} \end{matrix} & (11) \end{matrix}$

When the token feature w_(k) does not appear in document D_(i), the normalized NDF has the following expression.

$\begin{matrix} \begin{matrix} {{{nND}{F_{i}\left( w_{k} \right)}} = {\frac{1}{n}\left\lbrack {{d_{k}y} + {\left( {n - d} \right)\left( {2 + \gamma_{2} - y} \right)}} \right\rbrack}} \\ {= {{\left( {1 - \frac{d_{k}}{n}} \right)\left( {2 + \gamma_{2}} \right)} - {\left( {1 - \frac{2d_{k}}{n}} \right)y}}} \end{matrix} & (12) \end{matrix}$

The invention defines the pairwise Positive and Negative Document Frequency (PNDF) to be the sum of PDF and NDF. That is,

PNDF_(ij)(w _(k))=PDF_(ij)(w _(k))+NDF_(ij)(w _(k))   (13)

Similarly, the global normalized PNDF to be the sum of normalized PDF and NDF.

nPNDF_(i)(w _(k))=nPDF_(i)(w _(k))=nNDF_(i)(w _(k))   (14)

In the PDF and NDF formulas above, there are plenty flexibility of selecting the parameters. The invention proposes a specific choice by applying the Strict Proper Score Algorithm to a pair of documents. The strict proper score algorithm is a scoring method which assigns the inverse of the logarithm of its probability to each of its exclusive conditions. Thus let's fix some notations first.

Let

$\begin{matrix} {{\gamma_{1} = {\log\left( \frac{n}{1 + d} \right)}},{\gamma_{2} = {\log\left( \frac{n}{1 + n - d} \right)}},} & (15) \end{matrix}$

then we define the pairwise PDF as follows:

$\begin{matrix} {{{PDF}_{ij}\left( w_{k} \right)} = \left\{ \begin{matrix} {2\;\gamma_{1}} & {{{{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu} 2\mspace{14mu}{docs}\mspace{14mu} i}\&}j} \\ {{\log\;\frac{1}{2}} + \gamma_{1} + \gamma_{2}} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu}{only}\mspace{14mu}{one}\mspace{14mu}{doc}} \\ {2\gamma_{2}} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu}{neither}\mspace{14mu} i\mspace{14mu}{or}\mspace{14mu} j} \end{matrix} \right.} & (16) \end{matrix}$

Correspondingly, the NDF as

$\begin{matrix} {{{NDF}_{ij}\left( w_{k} \right)} = \left\{ \begin{matrix} {{2\;\gamma_{1}} - \delta} & {{{{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu}{docs}\mspace{14mu} i}\&}j} \\ {{\log\;\frac{1}{2}} + \gamma_{1} + \gamma_{2} + \delta} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu}{only}\mspace{14mu}{one}\mspace{14mu}{doc}} \\ {{2\gamma_{2}} - \delta} & {{if}\mspace{14mu} w_{k}\mspace{14mu}{appears}\mspace{14mu}{in}\mspace{14mu}{neither}\mspace{14mu} i\mspace{14mu}{or}\mspace{14mu} j} \end{matrix} \right.} & (17) \end{matrix}$

With the specific parameter values above, the corresponding normalized PDF and NDF weights have the following two different expressions according to the presence status of the feature w_(k). Let

${entropy}{(w) = {{\gamma_{1}\frac{d}{n}} + {\gamma_{2}{\frac{n - d}{n}.}}}}$

The corresponding global forms can easily be derived as follows

$\begin{matrix} {{{{{Case}\mspace{14mu} w} \in {D_{i}\text{:}\mspace{20mu}{nPDF}_{i}^{1}}} = {{{entropy}(w)} + \gamma_{1} + {\left( {1 - \frac{d_{k}}{n}} \right){\log\left( \frac{1}{2} \right)}}}}{{nNDF}_{i}^{1} = {{PDF}_{i}^{1} + {\left( {1 - \frac{2d_{k}}{n}} \right)\delta}}}} & (18) \\ {{{{{Case}\mspace{14mu} w} \notin {D_{i}\text{:}\mspace{14mu}{nPDF}_{i}^{0}}} = {{{entropy}(w)} + \gamma_{2} + {\frac{d_{k}}{n}{\log\left( \frac{1}{2} \right)}}}}{{nNDF}_{i}^{0} = {{PDF}_{i}^{0} - {\left( {1 - \frac{2d_{k}}{n}} \right)\delta}}}} & (19) \end{matrix}$

The two dual schemes also have the following relation:

PDF_(i) ¹+PDF_(i) ⁰=NDF_(i) ¹+NDF_(i) ⁰.   (20)

Note the weights above have two forms depending upon if a feature is present or absent in a document. This phenomena is caused by the introduction of dual document frequencies here. So the absence of a feature also contains useful information. In order to get such absence information, one needs redefine the feature to be the binary indicator of the absence of a token w. The corresponding term frequency document frequency representation vector is regarded as an additional component for the document distance computation.

The invention proposes a novel Binary Term Frequency (BTF) of a token feature w_(k) in a document D_(i) to be the following

$\begin{matrix} {{{BTF}_{i}\left( w_{k} \right)} = \left\{ \begin{matrix} 1 & {{{if}\mspace{14mu}{the}\mspace{14mu}{feature}\mspace{14mu}{count}\mspace{14mu} d} \geq {0\mspace{14mu}{in}\mspace{14mu}{doc}\mspace{14mu} D_{i}}} \\ 0 & {{{if}\mspace{14mu}{the}\mspace{14mu}{feature}\mspace{14mu}{count}\mspace{14mu} d} = {0\mspace{14mu}{in}\mspace{14mu}{doc}\mspace{14mu} D_{i}}} \end{matrix} \right.} & (21) \end{matrix}$

where the parameter d is the w_(k) feature count in document D_(i). This simplified term frequency essentially indicates the presence status of features in a document, by ignoring the term frequency magnitude.

Following the well-known TF-IDF spirit, the invention further proposes the Binary Term Frequency Inverse Document Frequency (BTF-IDF) as the multiplication of BTF with IDF. That is, for token feature w_(k) in document D_(i), we have the following formula:

BTF-IDF_(i)(w _(k))=BTF_(i)(w _(k))·IDF(w _(k))   (22)

Thus documents can be represented as a vector of such coordinates.

D_(i)=[BTF-IDF_(i)(w ₁), . . . , BTF-IDF_(i)(w _(m))]  (23)

For two documents D_(i) and D_(j), the Euclidean distance between such vectors is denoted as Dist_(bt f idf)(D_(i), D_(j))

$\begin{matrix} {{Dis{t_{btfidf}\left( {D_{i},D_{j}} \right)}} = \sqrt{\sum\limits_{k = 1}^{m}\left( {{B\; T\; F\text{-}{{IDF}_{i}\left( w_{k} \right)}} - {B\; T\; F\text{-}{{IDF}_{j}\left( w_{k} \right)}}} \right)^{2}}} & (24) \end{matrix}$

Similarly, the invention further proposes the Binary Term Frequency Positive Document Frequency (BTF-PDF) as the multiplication of BTF with the normalized PDF. That is, for token feature w_(k) in document D_(i) we have the following formula:

BTF-PDF_(i)(w _(k))=BTF_(i)(w _(k))·PDF(w _(k))   (25)

Thus documents can be represented as a vector of such coordinates.

D_(i)=[BTF-PDF_(i)(w _(i)), . . . , BTF-PDF_(i)(w _(m))]  (26)

For two documents D_(i) and D_(j), the Euclidean distance between such vectors is denoted as Dist_(bt f pdf)(D_(i), D_(j)).

$\begin{matrix} {{{Dist}_{btfpdf}\left( {D_{i},D_{j}} \right)} = \sqrt{\sum\limits_{k = 1}^{m}\left( {{B\; T\; F\text{-}{{PDF}_{i}\left( w_{k} \right)}} - {B\; T\; F\text{-}{{PDF}_{j}\left( w_{k} \right)}}} \right)^{2}}} & (27) \end{matrix}$

Similarly, the invention further proposes the Binary Term Frequency Positive Document Frequency (BTF-NDF) as the multiplication of BTF with the normalized NDF. That is, for token feature w_(k) in document D_(i), we have the following formula:

BTF-NDF_(i)(w _(k))=BTF_(i)(w _(k))·NDF(w _(k))   (28)

Thus documents can be represented as a vector of such coordinates.

D_(i)=[BTF-NDF_(i)(w ₁), . . . , BTF-NDF_(i)(w _(m))]  (29)

For two documents D_(i) and D_(j), the Euclidean distance between such vectors is denoted as Dist_(bt f ndf)(D_(i), D_(j)).

$\begin{matrix} {{{Dist}_{btfpdf}\left( {D_{i},D_{j}} \right)} = \sqrt{\sum\limits_{k = 1}^{m}\left( {{B\; T\; F\text{-}{{NDF}_{i}\left( w_{k} \right)}} - {B\; T\; F\text{-}{{NDF}_{j}\left( w_{k} \right)}}} \right)^{2}}} & (30) \end{matrix}$

Similarly, the invention further proposes the Binary Term Frequency Positive Negative Document Frequency (BTF-PNDF) as the multiplication of BTF with the normalized PNDF. That is, for token feature w_(k) in document D_(i) we have the following formula:

BTF-PNDF_(i)(w _(k))=BTF_(i)(w _(k))·PNDF(w _(k))   (31)

Thus documents can be represented as a vector of such coordinates.

D_(i)=[BTF-PNDF_(i)(w _(i)), . . . , BTF-PNDF_(i)(w _(m))]  (32)

For two documents D_(i) and D_(j), the Euclidean distance between such vectors is denoted as Dist_(bt f idf)(D_(i), D_(j)).

$\begin{matrix} {{{Dist}_{btfpdf}\left( {D_{i},D_{j}} \right)} = \sqrt{\sum\limits_{k = 1}^{m}\left( {{B\; T\; F\text{-}{{PNDF}_{i}\left( w_{k} \right)}} - {B\; T\; F\text{-}{{PNDF}_{j}\left( w_{k} \right)}}} \right)^{2}}} & (33) \end{matrix}$

To leverage the above document distances emphasizing the presence status of features, the invention proposes including the BTF with suitable consistent document frequency based Euclidean distance components in the document distance computation when using various weighting schemes discussed above.

The invention further generalize the pairwise PDF, NDF, PNDF and their normalized variants to sentences or short phrases by summing the individual weights in the corresponding sentences or phrases. We denote the weighting scheme as Sentence Positive Document Frequency (SPDF). Let s=w₁w₂ . . . w_(k), where w_(i)'s are nonstop words, then

$\begin{matrix} {{S\; P\; D\;{F(s)}} = {\sum\limits_{i = 1}^{k}{P\; D\;{F\left( w_{i} \right)}}}} & (34) \end{matrix}$

where the PDF can be the pairwise PDF or its normalized variants depending on the application context.

Similarly, the invention extends the definition for pairwise NDF or its normalized variant for sentences or short phrases, and denoted the sentence level NDF as SNDF.

$\begin{matrix} {{S\; N\; D\;{F(s)}} = {\sum\limits_{i = 1}^{k}{N\; D\;{F\left( w_{i} \right)}}}} & (35) \end{matrix}$

Similarly, the invention proposes the Positive Negative Document Frequency (PNDF) to be the natural summation of PDF and NDF. For sentences or short phrases, the generalized SPNDF is defined to be the accumulated sum of PNDF over tokens in the corresponding sentence or phrase, which is also equivalent to the sum of SPDF and SNDF.

$\begin{matrix} {{S\; P\; N\; D\;{F(s)}} = {{\sum\limits_{i = 1}^{k}{P\; N\; D\;{F\left( w_{i} \right)}}} = {{S\; P\; D\;{F(s)}} + {S\; N\; D\;{F(s)}}}}} & (36) \end{matrix}$

Both the above pairwise weighting schemes and their normalized variants can be straightforwardly applied to any pairwise document distance computing scenarios. These OT based document distance needs to solve an optimization problem at the cost of O(n³ log(n)) complexity. For the classical Euclidean distance computation, all the normalized weighting schemes introduced above can be applied straightforwardly to the features. Each document only needs once coordinates multiplication with weight schemes, while using pairwise weights requires different multiplication for each different pair of documents. These classical Euclidean distance computations has linear complexity only. Detailed demonstration in the two scenarios of optimal transportation based word token or sentence moving distance and Euclidean distance for text documents is given below.

The corpus collection of documents here is very generic. For example, a document could be a webpage, a news article, a facebook message etc. A feature could be a word, a generic token or symbol, or something slightly complex such as a sentence or a short phrase etc.

As an illustration, we show how the weighting schemes can be applied to the classical Euclidean distance computation and the optimal transportation based document distance computation. We use the same notation as in the background section above. At the word token level, the normalized word frequency vectors D_(i)=[f_(i1), f_(i2), . . . , f_(im)] and D_(j)=[f_(j1), f_(j2), . . . , f_(jm)] represent document D_(i), and document D_(j). At the sentence level we get the similar representations D_(i)=[s f_(i1), s f_(i2), . . . , s f_(iM)] and D_(j)=[s f_(j1), s f_(j2), s f_(jM)], where M is the total different sentences or phrases in the two documents and s f_(ik) denotes the normalized frequency count of k-th sentence in document D_(i).

For the classical Euclidean distance computation for a pair of documents, we use PDF_(ij)(w) denote the pairwise PDF of word token w for document D_(i) and document D_(j). Multiplying each frequency with the corresponding PDF or its variant gives D_(i)=[f_(i1)PDF_(ij)(w₁)], . . . f_(im)PDF_(ij)(w_(m))] and Y_(j)=[f_(j1)PDF_(ij)(w₁), . . . , f_(jm)PDF_(ij)(w_(m))].

Recall the normalized weight schemes PDF and NDF have two forms for the presence or absence of a feature in a document. The default presence form contains most information while the absence form also contains some useful information. In the Euclidean distance computation below, a document can have both components and the distance between a pair of documents is the sum of the two corresponding component distances.

For the classical Euclidean distance computation with normalized weight schemes, then the normalized weight for features present in a document gives a document representation as D_(i)=[f_(i1)PDF(w₁)], . . . f_(im)PDF(w_(m))], where the PDF(w_(k)) is the default value when the feature is present in a document. This document representation can be used for the distance computation with other document in the corpus, without the need for further coordinates weight-update. So this global weight has advantages on computation cost. In the same spirit, the invention proposes the normalized weights of PDF, NDF and PNDF for the representation. Thus NDF gives

D_(i) =[f _(i1)NDF(w ₁)], . . . , f _(im)NDF(w _(m))], and PNDF gives

D_(i) =[f _(i1)PNDF(w ₁)], . . . , f _(im)PNDF(w _(m))].

Note the normalized weights also have a value when a feature is absent in a document. In order to use such information, the invention proposes first computing the negative term frequencies, denoted as n f, by counting if each feature is present in a document. That is, if a feature is present in a document, n f(w_(k))=0, otherwise n f(w_(k))=1. Thus the normalized weight for the absence of features gives each document representation as D_(i)=[n f_(i1)PDF (w₁)], . . . , n f_(im)PDF(w_(m))],where the PDF(w_(k)) here is the default value when the feature is absence in a document. Similarly NDF gives D_(i)=[n f_(i1)NDF(w₁)], . . . , n f_(im)PNDF(w_(m))], and PNDF gives D_(i)=[n f_(i1)PNDF(w₁)], . . . , n f_(im)PNDF(w_(m))].

The classical Euclidean distance between documents X=[x₁, . . . , x_(m)] and Y=[y₁, . . . y_(m)], denoted as Dist_(XY), is then given as

$\begin{matrix} {{Dist_{XY}} = \sqrt{\sum\limits_{k = 1}^{n}\left( {x_{k} - y_{k}} \right)^{2}}} & (37) \end{matrix}$

The equation gives all the distance for documents weighting using either IDF, PDF, NDF or PNDF.

For the Euclidean distance computation, the invention proposes the PNDF weight based distance with the usual term frequencies, also plus the PNDF weight based distance with the BTF.

Note the BTF-PNDF gives D_(i) ^(b)=[bt f_(i1)PNDF(w₁)], . . . , bt f_(im)(w_(m))], while the TF-PNDF gives D_(i)=[f_(i1)PNDF(w₁)], . . . , f_(im)PNDF(w_(m))].

$\begin{matrix} \begin{matrix} {{Dist}_{ij} = \sqrt{{{dist}\left( {D_{i}^{b},D_{j}^{b}} \right)}^{2} + {{dist}\left( {D_{i},D_{j}} \right)}^{2}}} \\ {= \sqrt{\sum\limits_{k = 1}^{m}{{{PNDF}^{2}\left( w_{k} \right)}\left\lbrack {\left( {f_{ik} - f_{\;^{jk}}} \right)^{2} + \left( {{btf_{ik}} - {btf_{jk}}} \right)^{2}} \right\rbrack}}} \end{matrix} & (38) \end{matrix}$

For the Euclidean distance computation, the invention also proposes the easier IDF weight based distance with the usual term frequencies, also plus the IDF weight based distance with the BTF.

Note the BTF-IDF gives D_(i) ^(b)=[bt f_(i1)IDF(w₁)], . . . , bt f_(im)IDF(w_(m))], while the TF-IDF gives D_(i)=[f_(i1)IDF(w₁)], . . . , f_(im)IDF(w_(m))].

$\begin{matrix} \begin{matrix} {{Dist}_{ij} = \sqrt{{{dist}\left( {D_{i}^{b},D_{j}^{b}} \right)}^{2} + {{dist}\left( {D_{i},D_{j}} \right)}^{2}}} \\ {= \sqrt{\sum\limits_{k = 1}^{m}{{{IDF}^{2}\left( w_{k} \right)}\left\lbrack {\left( {f_{ik} - f_{\;^{jk}}} \right)^{2} + \left( {{btf_{ik}} - {btf_{jk}}} \right)^{2}} \right\rbrack}}} \end{matrix} & (39) \end{matrix}$

For the optimal transportation, the invention proposes applying the pairwise PNDF weighting to the normalized word frequency vectors X=[x₁, x₂, . . . , x_(n)] and Y=[y₁, y₂, . . . , y_(n)]. Then we need to normalize the vectors one more time and then solve the corresponding OT distances optimization problem using standard linear program solver or numeral approximation. For example, using PNDF weighting schemes gives X_(i)=x_(i)PNDF_(XY)(w_(i)) and Y_(j)=y_(j)PNDF_(XY)(w_(j)). Here we need to re-normalize X_(i) and Y_(j), and make the vectors X and Y to have their coordinates sum equal to unity. Similarly, using PNDF weighting gives X_(i)=x_(i)PNDF_(XY)(w_(i)) and Y_(j)=y_(j)PNDF_(XY)y(w_(j)), and do the same re-normalization.

The invention also proposes an integrated distance by including the pairwise PNDF based BTF Euclidean distance with the standard OT distance above. Note the BTF-PNDF gives X_(i)=bt f x_(i)PNDF _(XY)(w_(i)) and Y_(j)=bt f y_(j)PNDF_(XY)(w_(j)), where bt f x_(i) and bt f y_(j) denote the binary version of the original feature x_(i) and y_(j).

In the using PNDF weighting scheme such as above scenarios, the invention proposes tuning the parameter y in PNDF expression a bit for non-negative range for ideal performance in training of machine learning tasks such as document classification. For example, given the computed pairwise document distance, one can follow with a K Nearest Neighbor (KNN) algorithm for classification. First using the training data to find the optimal parameters such as y and then use them on the test dataset.

At the sentence level, using SNDF weighting gives X_(i)=x_(i)SNDF_(XY)(w_(i)) and Y_(j)=y_(j)SNDF_(XY)(w_(j)). Similarly, using SPNDF or its variant weighting gives X_(i)=x_(i)SPNDF_(XY)(w_(i)) and Y_(j)=y_(j)SPNDF_(XY)(w_(j)). We will also need do one more re-normalization before solving the corresponding optimization problem. The rest can be computed in the standard OT framework as above.

While the various embodiment of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. As it is easy for a skilled person to make various changes in form and detail therein without departing from the spirit and scope of the invention. It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described and all statements of the scope of the invention which, as a matter of language, might be said to fall there between. 

What is claimed:
 1. A novel pairwise document frequency weighting method, Negative Document Frequency(NDF) for a corpus of documents, comprising: choosing the intended feature set of documents; performing a feature token or symbol counting for each pair of documents, with count value ends up in 0,1 or 2; selecting parameters γ₁, γ₂ and y; assigning a weighting value using the defining formula with selected parameter.
 2. The method of claim 1, wherein its symmetric dual pairwise Positive Document Frequency (PDF) comprising: two parameters γ₁ and γ₂; three cases respectively for the feature count value 0, 1, and 2; the values are described precisely in equation (6); its relation to NDF is given in equations (10).
 3. The method of claim 1, further comprising: summing with its dual PDF above gives the integrated comprehensive pairwise Positive Negative Document Frequency (PNDF) document frequency.
 4. The method of claim 3, further comprising: computing the global normalized form PNDF across the corpus as the average of all pairwise PNDF by iterating the corpus; multiplying the feature term frequencies with corresponding PNDFs to get the TF-PNDF document representation vectors; computing the pairwise Euclidean distances among the documents.
 5. The method of claim 4, wherein the feature is a slightly complex structure such as a sentence rather than the simple discrete token or symbol, further comprising: computing the token weights first and sum them up as the assigned weight for the complex sentence feature.
 6. The Strict Proper Score based Positive Document Frequency comprising: three cases respectively for the feature count value 0, 1, and 2; each case's value is given as the inverse of logarithm of each case's probability; the expression has two parameters γ₁ and γ, which can be described by the document frequency and corpus size using equation (15); computing the normalized forms by iterating the corpus and averaging all the values.
 7. The method of claim 6, further comprising: computing its dual NDF using equation (17); computing the normalized forms by iterating the corpus and averaging all the values; further computing the sum of the normalized PDF and NDF.
 8. The method of claim 7, further comprising: applying the pairwise PNDF weighting to each of pair of documents for the Optimal transportation based word token or sentence moving distance for machine learning tasks such as classification and prediction etc; applying the normalized PNDF weighting to each document for the Euclidean document distance for machine learning tasks such as classification and prediction etc.
 9. The Binary Term Frequency (BTF) based document frequency method comprising: mapping the standard term frequencies of a document to the binary indicator of feature presence; selecting a document frequency such as IDF or normalized PNDF to multiply with; obtaining a BTF-PNDF type document representation vector; obtaining a BTF-IDF type document representation vector.
 10. The method of claim 9, further comprising: computing the Euclidean distance between documents using their normalized BTF-PNDF based representation vectors; computing the Euclidean distance between documents using their BTF-IDF based representation vectors.
 11. The method of claim 9, further comprising: selecting a pairwise document frequency PNDF and computing the pairwise BTF-PNDF representation vectors; adding the computed BTF based Euclidean distance above to the optimal transportation distance computed in claim
 8. 12. The method of claim 11, further comprising: adding the TF-PNDF based Euclidean distance computed in claim 4 to obtain the integrated document distance.
 13. A document distance computing system comprising: a server, including a processor and a memory, to: accepting inputs as a collection of document; selecting a type of feature which can be a discrete token or symbol; computing the feature frequency counts for each document and normalize it to a unit vector; selecting a type of document frequency weighting such as PNDF and then computes the TF-PNDF document representation vectors; computing document distance for each pair of documents in the corresponding framework, where it could be classical Euclidean document distance or the optimal transportation based word token moving distance.
 14. The system of claim 13, wherein the server adds the suitable pairwise BTF-PNDF based Euclidean distance to the optimal transportation distance; the server adds the suitable normalized BTF-PNDF or BTF-IDF based Euclidean distance to the classical Euclidean distance.
 15. The system of claim 14, wherein the server' outputs may be followed by applying a standard procedure such as K Nearest Neighborhood (KNN), Support Vector Machine (SVM), Boosting Decision Trees or some Neural Network models etc for classification or prediction tasks etc.
 16. The system of claim 14, wherein the server uses the slightly complex features such as sentences or short phrases rather than discrete tokens. For the sentence-like structure features, the server sums the corresponding individual document frequency weights of each token in the sentence-like features.
 17. The system of claim 13, wherein the document distance uses the optimal transportation, the server uses the memory to store the word vectors for the vocabulary; and the server computes the pairwise word vector distance as the transportation cost of moving a word unit to another word unit in the word pair. The server then uses standard linear program solver for the optimal transportation plan estimation. 